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Abstract The standard assumption for proving linear convergence of first 
order methods for smooth convex optimization is the strong convexity of the 
objective function, an assumption which does not hold for many practical ap¬ 
plications. In this paper, we derive linear convergence rates of several first 
order methods for solving smooth non-strongly convex constrained optimiza¬ 
tion problems, i.e. involving an objective function with a Lipschitz continuous 
gradient that satisfies some relaxed strong convexity condition. In particular, 
in the case of smooth constrained convex optimization, we provide several re¬ 
laxations of the strong convexity conditions and prove that they are sufficient 
for getting linear convergence for several first order methods such as projected 
gradient, fast gradient and feasible descent methods. We also provide examples 
of functional classes that satisfy our proposed relaxations of strong convexity 
conditions. Finally, we show that the proposed relaxed strong convexity condi¬ 
tions cover important applications ranging from solving linear systems. Linear 
Programming, and dual formulations of linearly constrained convex problems. 


1 Introduction 

Recently, there emerges a surge of interests in accelerating first order methods 
for difficult optimization problems, for example the ones without strong convex 
objective function, arising in different applications such as data analysis [6] or 
machine learning [^. Algorithms based on gradient information have proved 
to be effective in these settings, such as projected gradient and its fast variants 
stochastic gradient descent m or coordinate gradient descent m- 
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For smooth convex programming, i.e. optimization problems with convex ob¬ 
jective function having a Lipschitz continuous gradient with constant Lf > 0, 
first order methods are converging sublinearly. In order to get an e-optimal so¬ 
lution, we need to perform O or even O calls to the oracle m- 

Typically, for proving linear convergence of the first order methods we also need 
to require strong convexity for the objective function. Unfortunately, many 
practical applications do not have strong convex objective function. A new 
line of analysis, that circumvents these difficulties, was developed using several 
notions. For example, sharp minima type condition for non-strongly convex op¬ 
timization problems, i.e. the epigraph of the objective function is a polyhedron, 
has been proposed in niziiin]. An error bound property, that estimates the 
distance to the solution set from any feasible point by the norm of the proximal 
residual, has been analyzed in laiiniin]. Finally, a restricted (also called es¬ 
sential) strong convexity inequality, which basically imposes a quadratic lower 
bound on the objective function, has been derived in [ans]. For all these 
conditions (sharp minima, error bound or restricted strong convexity) several 
gradient-type methods are shown to converge linearly, see e.g. [HHiniiiain]- 
Several other papers on linear convergence of first order methods for non- 
strongly convex optimization have appeared recently mm The main goal of 
this paper is to develop a framework for finding general functional conditions 
for smooth convex constrained optimization problems that allow us to prove 
linear convergence for a broad class of first order methods. 

Contributions: For smooth convex constrained optimization, we show in this 
paper that some relaxations of the strong convexity conditions of the objective 
function are sufficient for obtaining linear convergence for several first order 
methods. The most general relaxation we introduce is a quadratic functional 
growth condition, which states that the objective function grows faster than 
the squared distance between any feasible point and the optimal set. We also 
propose other non-strongly convex conditions, which are more conservative 
than the quadratic functional growth condition, and establish relations be¬ 
tween them. Further, we provide examples of functional classes that satisfy 
our proposed relaxations of strong convexity conditions. For all these smooth 
non-strongly convex constrained optimization problems, we prove that the cor¬ 
responding relaxations are sufficient for getting linear convergence for several 
first order methods, such as projected gradient, fast gradient and feasible de¬ 
scent methods. We also show that the corresponding linear rates are improved 
in some cases compared to the existing results. We also establish necessary and 
sufficient conditions for linear convergence of the gradient method. Finally, we 
show that the proposed relaxed strong convexity conditions cover important 
applications ranging from solving linear systems. Linear Programming, and 
dual formulations of linearly constrained convex problems. 

Notations: We work in the space K" composed by column vectors and R" 
denotes the non-negative orthant. For u,v G R” we denote the Euclidean 
inner product {u,v) = u^v, Euclidean norm ||u|| = {u, u) and the projection 
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of u onto convex set X as ['u\x = argminj^gx ||a: — m||. For matrix A £ 
we denote cri„in(^) the smallest nonzero singular value and || A|| spectral norm. 


2 Problem formulation 


In this paper we consider the class of convex constrained optimization prob¬ 
lems: 

(P) : f*=minf{x), 

x£X 

where X C R" is a simple closed convex set, that is the projection onto this 
set is easy, and / : V —>■ R is a closed convex function. We further denote 
by X* = argmina;gjc/(a;) the set of optimal solutions of problem (P). We 
assume throughout the paper that the optimal set X* is nonempty and closed 
and the optimal value /* is finite. Moreover, in this paper we assume that the 
objective function is smooth, that is / has Lipschitz continuous gradient with 
constant Lf > 0 on the set X: 

IIV/(x) - Xf{y)\\ < Lf\\x - y\\ Vx, y G X. (1) 

An immediate consequence of o is the following inequality m- 

f{y)<f{x) + {Vf{x),y-x) + ^\\x-yf Vx, 2 /GA, (2) 

while, under convexity of /, we also have: 

0<{Xf{x)-Xf{y),x-y)<Lf\\x-yf \/x,y G X. (3) 


It is well known that first order methods are converging sublinearly on the class 
of problems whose objective function / has Lipschitz continuous gradient with 
constant L/ on the set A, e.g. convergence rates in terms of function values 
of order m- 


-r< 

- r < 


Lf\\x^-x*r 

2k 

2Lf\\x°-x*r 

(fc+l)2 


for projected gradient, 
for fast gradient. 


( 4 ) 


where x^ is the kth iterate generated by the method. Typically, in order to 
show linear convergence of first order methods applied for solving smooth 
convex problems, we need to require strong convexity of the objective function. 
We recall that / is strongly convex function on the convex set X with constant 
(7/ > 0 if the following inequality holds [llj : 


f{ax -k (1 - a)y) < af{x) -k (1 - a)f{y) - —^||a; - yf (5) 


for all x,y G X and a G [0, 1]. Note that if <7/ = 0, then / is simply a convex 
function. We denote by 5^^ the class of (7/-strongly convex functions 
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with an Ly-Lipschitz continuous gradient on X. First order methods are con¬ 
verging linearly on the class of problems (P) whose objective function / is in 
convergence rates of order [TT] : 


- r < 


Lf\\x^-x*r 


1 - 




for projected gradient, 

* (6) 


- /* < 2 (/(x") - /*) for fast gradient. 

In the case of a differentiable function / with L/-Lipschitz continuous gra¬ 
dient, each of the following conditions below is equivalent to inclusion / G 




{X) m-- 


f{y) > fix) + {Vfix),y -x) + ^\\x- ?/|p Vx, yGX, 
<xf\\x - yf < (V/(a;) - V/(y), x - y) Vx, y £ X 


Let us give some properties of smooth strongly convex functions from the class 
^Lf a; i-^)- Firstly, using the optimality conditions for (P), that is (V/(a:*), y— 
X*) > 0 for all y £ X and x* £ X*, in the first inequality in JT]) we get the 
following relation: 

fix)- r>^\\x-x*f Vx£X. (8) 

Further, the gradient mapping of a continuous differentiable function / with 
Lipschitz gradient in a point x £ K" is defined as m- 


gix) = Lf{x-[x- l/LfXfix)]x ), 


If additionally, the function / has also Lipschitz continuous gradient, then we 
obtain a second relation valid for any / £ 5^^ afi^) [H] [Lemma 22]: 

Ylk-yjl < ||y(x)-y(y)|l Vx,y£X. (9) 

However, in many applications the strong convexity condition m or equiva¬ 
lently one of the conditions Q cannot be assumed to hold. Therefore, in the 
next sections we introduce some non-strongly convex conditions for the ob¬ 
jective function / that are less conservative than strong convexity. These are 
based on relaxations of strong convexity relations 0-®- 


3 Non-strongly convex conditions for a function 

In this section we introduce several functional classes that are relaxing the 
strong convexity properties 0-® of a function and derive relations between 
these classes. More precisely, we observe that strong convexity relations 0 
or ® are valid for all x,y £ X. We propose in this paper functional classes 
satisfying conditions of the form 0 or ® that hold for some particular choices 
of X and y, or satisfying simply the condition ([5]). 
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3.1 Quasi-strong convexity 

The first non-strongly convex relaxation we introduce is based on choosing a 
particular value for y in the first strong convexity inequality in ([7]), that is 
y = X = [x\x* (recall that [a;]x* denotes the projection of x onto the optimal 
set V* of convex problem (P)): 

Definition 1 Continuously differentiable function / is called quasi-strongly 
convex on set X if there exists a constant k j > 0 such that for any x G X and 
X = [a;]x* we have: 

/* >/(a:) + (V/(a:),x - a;)-b yIIx -Wx G X. (10) 

Note that inequality (nni alone does not even imply convexity of function /. 
Moreover, our definition of quasi-strongly convex functions does not ensure 
uniqueness of the optimal solution of problem (P) and does not require / to 
have Lipschitz continuous gradient. We denote the class of convex functions 
with Lipschitz continuous gradient with constant L/ in and satisfying 
the quasi-strong convexity property with constant k/ in pOll by ^^{X). 
Clearly, for strongly convex functions with constant k/, from the first condition 
in 0 with y = X* G X* , we observe that the following inclusion hold: 

( 11 ) 

Moreover, combining the inequalities 0 and (nni), we obtain that the condition 
number of objective function / G (V), defined asfif = k//L/, satisfies: 

0<M/<1- (12) 

We will derive below other functional classes that are related to our newly 
introduced class of quasi-strongly convex functions qSj^^ ^^{X). 


3.2 Quadratic under-approximation 

Let us define the class of functions satisfying a quadratic under-approximation 
on the set X, obtained from relaxing the first inequality in o by choosing 
y = X and x = x = [x] jc*: 

Definition 2 Continuously differentiable function / has a quadratic under¬ 
approximation on X if there exists a constant Kf > 0 such that for any x G X 
and x = [x\x* we have: 

f{x)>f* + {Xf{x),x-x)+'^-\\x-x\\'^ \/xGX. (13) 

We denote the class of convex functions with Lipschitz continuous gradient 
and satisfying the quadratic under-approximation property m on X by 
Ulj Then, we have the following inclusion: 
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Theorem 1 Inequality (flTlll implies inequality (na). Therefore, the following 
inelusion holds: 


qSL,,.,{X)^UL,,.,{X). (14) 

Proof Let / € Since / is convex function, it satisfies the inequality 

m with some constant k/( 0) > 0, i.e.: 


fix) > f* + {Vf{x),x-x) + . 


(15) 


Using first order Taylor approximation in the integral form we have: 

fix) = fix) + [ (V/(x + r(x - x)), X - x)dT 
Jo 

1 _ 

= fix) + / —(V/(x + r(x — x)), t(x — x))(iT 

Jo X 

not in x+t(x — x) ^ / Kt „\ 

> fix) + ~ ~ ^)) “ /(^) + -Y^\xix - x)f j dr 

+ ^^^lk(x-x; 

= fix) + (V/(x),x - x) + - xf + ^||x - xf dr 

= fix) + (V/(x),x-x) + . i||a;_;i||2. 


+ -^lk(x- x)fdr 

T Z 


If we denote k/( 1) = then we get that inequality (ITSl) also holds 

for K/(l). Repeating the same argument as above for / € qSj^^ nji^) 
satisfying m for K/(l) we get that inequality (ITH1) also holds for k/( 2) = 
«f(i)+«/ _ Kf (oH-3kj ^ jterating this procedure we obtain that after t steps: 


v(«) = 


Kf jO) + (2* - 1)^^ 
2‘ 




as t —>■ oo. 


Since after any t steps the inequality (1151) holds with K/(t), using continuity 
of Kfit) in (fT5ll we obtain (IT!?ll . This proves our statement. □ 

Moreover, combining the inequalities © and (USD, we obtain that the condition 
number of objective function / € ULf,K,fiX), defined as pf = Kf/Lf, satisfies: 


0 < pf <1. 


(16) 
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3.3 Quadratic gradient growth 

Let us define the class of functions satisfying a bound on the variation of 
gradients over the set X. It is obtained by relaxing the second inequality in 
d?]) by choosing y = x = 

Definition 3 Continuously differentiable function / has a quadratic gradient 
growth on set X if there exists a constant Hf > 0 such that for any x G X and 
X = we have: 

{Xf{x) — Xf{x),x — x) > Kf\\x — xW"^ Wx G X. (17) 

Now, let us denote the class of convex differentiable functions with Lipschitz 
gradient and satisfying the quadratic gradient growth 113 by 
In [IS] the authors analyzed a similar class of objective functions, but for 
unconstrained optimization problems, that is X = K", which was called re¬ 
stricted strong convexity and was defined as: there exists a constant k/ > 0 
such that {X f{x),x — x) > Kf\\x — a;||^ for all x G R". An immediate conse¬ 
quence of Theorem [T] is the following inclusion: 

Theorem 2 Inequality (fTOl) implies inequality (HZI). Therefore, the following 
inclusion holds: 


(18) 

Proof If / G then / satisfies the inequality (fTHI) . From Theorem 

[Tjwe also have that / satisfies inequality dT51) . By adding the two inequalities 
(US and m in X we get: 

{Xf{x)— Xf{x),x — x) > Kf\\x — x\\'^ 'ixGX, (19) 

which proves that inequality 113 holds. □ 

We prove below that (US or (USD alone and convexity of / implies (113 with 
constant Kf/2. Indeed, let us assume for example that (fTlTll holds, then we have: 

f{x) > f* + {Xf{x),x - x) + '^-\\x - x\\^ 

> f{x) + {Xf{x),x -x)-\- (V/(x), X - x) y\\^ - ^11^ 

= fix) + (V/(x) - Xf{x),x- x) -b ylla: - 

which leads to (fTTll with constant k// 2. Combining the inequalities ([3|) and 
113 , we obtain that the condition number of objective function / G Glj,kj iX), 
satisfies: 


0 < ^/ < 1. 


(20) 
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Theorem 3 Inequality (fTTll implies inequality 0 . Therefore, the following 
inelusion holds: 


(21) 

Proof Let / G then from first order Taylor approximation in the 

integral form we get: 

f{x) = f{x)+f {Xf{x + t{x-x)),x-x)dt 
Jo 

= f{x) + {Xf{x),x — x) + j {Xf{x + t{x — x)) — Xf{x),x — x)dt 

Jo 

f^l _ 

= f{x) + {Xf{x),x- x)+ -{Xf{x + t{x - x)) - Xf{x),t{x - x))dt 

Jo ^ 

> f{x) + {Xf{x),x-x)+ / -Kf\\t{x- x)\\^dt 

Jo ^ 

= f{x) + {Xf{x), x-x) + '^\\x- , 

where we used that [x + t{x — a:)]js:* = x for any t € [0, 1]. This chain of 
inequalities proves that / satisfies inequality (1131) with the same constant kj. 

□ 


3.4 Quadratic functional growth 

We further define the class of functions satisfying a quadratic functional growth 
property on the set X. It shows that the objective function grows faster than 
the squared distance between any feasible point and the optimal set. More 
precisely, since {Xf{x*),y — x*) > 0 for all j/ G X and x* G X*, then using 
this relation and choosing y = x and x = x = [x\x* in the first inequality in ([7]). 
we get a relaxation of this strong convexity condition similar to inequality ([5]) : 

Definition 4 Continuously differentiable function / has a quadratic func¬ 
tional growth on X if there exists a constant Kf > 0 such that for any x G X 
and X = [x]x* we have: 


/(x)-r > ^||x-xf VxGW (22) 

Since the above quadratic functional growth inequality is given in x, this 
does not mean that / grows everywhere faster than the quadratic function 
k//2||x —x|p. We denote the class of convex differentiable functions with Lip- 
schitz continuous gradient and satisfying the quadratic functional growth (1221) 
by J^Lf,Kf (^)- We now derive inclusion relations between the functional classes 
we have introduced so far: 
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Theorem 4 The following chain of implications are valid: 

(iTii^dTnii^diTD^dni)^ (na). 

Therefore, the following inclusions hold: 

^ C C C (23) 

Proof From the optimality conditions for problem (P) we have {Vf{x),x — 
x) > 0 for all a; € X. Then, for any objective function / satisfying (na, i.e. 
/ G we also have (1^^ . In conclusion, from previous derivations, 

m and Theorems [5] and [3] we obtain our chain of inclusions. □ 

Let us define the condition number of objective function / G KjiX) as 
fif = If the feasible set X is unbounded, then combining da with (1331) 
and considering ||a; — a;|| —?► oo, we conclude that: 


0<M/<1. (24) 

However, if the feasible set X is bounded, we may have >> Lf, provided 
that IIV/(a;) II is large, and thus the condition number might be greater than 1: 

Tf > 1- (25) 

Moreover, from the inclusions given by Theorem |a we conclude that: 

l^fiS) < /a/((?5) < < pf{U) < 

Let us denote the projected gradient step from x with: 

x+ = [x - l/L/V/(x)]x, 

and its projection onto the optimal set X* with x'^ = Then, we will 

show that if x'^ is closer to X* than x, then the objective function / must 
satisfy the quadratic functional growth (na: 

Theorem 5 Let f be a convex function with Lipschitz continuous gradient 
with constant Lf. If there exists some positive constant /3 < I such that the 
following ineguality holds: 

||x+ — al''’|| < P\\x — x|| Vx G X, 


then f satisfies the quadratic functional growth (1221) on X with the constant 
Kf=Lf{l-/3r. 
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Proof On the one hand, from triangle inequality for the projection we have: 
||a^ — a;|| < ||x — < ||a: — 

Combining this relation with the condition from the theorem, that is — 
< l3\\x — a;||, we get: 


( 1 -, 


\x — x\\ < \\x — x'^ 


(26) 


On the other hand, we note that x'^ is the optimal solution of the problem: 


= argmin 

z^X 


fix) + C^fix), z - x) + ^\\z - xf 


(27) 


From (ED we have: 

fix~^) < fix) + (V/(x),x+ -x) + ^\\x^ - xf 

and combining with the optimality conditions of (EH) in X, that is (V/(x) + 
Lfix'^ — x),x — x^) < 0, we get the following decrease in terms of the objective 
function: 

fix+)<fix)-^\\x+-xr. (28) 

Finally, combining (1^ with (1^ . and using /(x+) > /*, we get our statement. 

□ 


3.5 Error bound property 

Let us recall the gradient mapping of a continuous differentiable function / 
with Lipschitz continuous gradient in a point x € K”: gix) = Lfix — x'^), 
where x+ = [x — l/L/V/(x)]x is the projected gradient step from x. Note 
that gix*) = 0 for all x* G X*. Moreover, if X = R", then g(x) = Xfix). 
Recall that the main property of the gradient mapping for convex objective 
functions with Lipschitz continuous gradient of constant L/ is given by the 
following inequality [llj [Theorem 2.2.7]: 

fiy) > fix~^) + i 9 ix),y - x) + ^\\gix)f VyGX and x G R”. (29) 

2Lf 

Taking y = x in (l29ll and using that /(x"*") > /*, we get the simpler inequality: 

( 5 (x),x-x) > ;^||y(x)f VxGR”. (30) 

2Lf 

In [8] Tseng introduced an error bound condition that estimates the distance 
to the solution set from any feasible point by the norm of the proximal residual: 
there exists a constant k > 0 such that ||x — x|| < k||x — [x — V/(x)]jf|| for 
all X G X. This notion was further extended and analyzed in Enun]. Next, 
we define an error bound type condition, obtained from the relaxation of the 
strong convex inequality (O for the particular choice y = x = [x\x* ■ 
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Definition 5 The continuous differentiable function / has a global error bound 
on X if there exists a constant Kf > 0 such that for any x G X and x = [x\x* 
we have: 


\\g{x)\\> nj\\x-x\\ 'ixGX. (31) 

We denote the class of convex functions with Lipschitz continuous gradient and 
satisfying the error bound (IHTl) by Let us define the condition number 

of the objective function / £ l^f ~ Combining inequality 

o and (1311) we conclude that the condition number satisfies the inequality: 

0< Hf <2. (32) 

However, for the unconstrained case, i.e. X = R" and Xf{x) = 0, from (H]) and 
(1^ we get 0 < fif < 1. We now determine relations between the quadratic 
functional growth condition and the error bound condition. 

Theorem 6 Inequality (13111 implies inequality (1221) with constant gj ■ Kf. 
Therefore, the following inclusion holds for the functional class ^^(X): 

(33) 

Proof Combining (E51l and (IHTl) we obtain: 

Kf\\x - x\\^ < ||g(a;)f < 2Lf{f{x) - f{x+)) < 2Lf{f{x) - f*) Vx £ X. 

In conclusion, inequality (12211 holds with the constant = p,f ■ Kf, where we 
recall pLf = Kf/Lf. This also proves the inclusion: C (-^)- 

□ 


Theorem 7 Inequality (I22p implies inequality (1311) with constant - ^ . • 

Kf. Therefore, the following inclusion holds for the functional class ^^(X).' 


rL,..,mc£ 


l + fj-f +ynf+jbj 




(34) 


Proof From the gradient mapping property (1291) evaluated at the point y = 
ab*' = [x'^]x», we get: 


f* > f{x+) + {g{x), x+ - x) + :^\\g{x)f 

ZLf 

= /(a;+) + {g{x),x+ -x+) - :^||g(x)f. 

ZLf 


Further, combining the previous inequality and 


we obtain: 


(g(x),x+ -x+) + - f*> ■ 
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Using Cauchy-Schwartz inequality for the scalar product and then rearranging 
the terms we obtain: 

^ (||g(x)|| + Lf\\x+ - S+ID" > - S+f 

or equivalently 


Il5(2;)ll + - a:+|| > Lf{Kf + Lf)\\x+ - a;+||. 

We conclude that: 

||g(a:)|| > Lf{Kf + Lf) - Lj^ ||a:+ -x+||. 

Since 

||x - x|| < llx - x+ll < \\x - x+ll -b ||a:+ - x+|| = ^||g(a:)|| -b ||a:+ - S+||, 
then we obtain: 

Il5(a^)ll > Lf{Kf + Lf) - (^||a:-x|| - -^\\g{x)\\^ . 


After simple manipulations and using that /i/ = we arrive at: 






1 + g,f + v'T+Tu 


which shows that inequality dSD is valid for the constant 


l+MZ-Hyi+W ^ 


Note that the functional classes we have introduced previously were obtained 
by relaxing the strong convexity inequalities dZD-® for some particular choices 
of X and y. The reader can find other favorable examples of relaxations of 
strong convexity inequalities and we believe that this paper opens an window 
of opportunity for algorithmic research in non-strongly convex optimization 
settings. In the next section we provide concrete examples of objective func¬ 
tions that can be found in the functional classes introduced above. 


4 Functional classes in and 

We now provide examples of structured convex optimization problems whose 
objective function satisfies one of our relaxations of strong convexity conditions 
that we have introduced in the previous sections. We start first recalling some 
error bounds for the solutions of a system of linear equalities and inequalities. 
Let A € C € and the arbitrary norms IHU and IHI^ in M™+p 

and R”. Given the nonempty polyhedron: 

iP = {a; e R” : Ax = b, Cx< d}, 
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then there exists a constant 9{A^ C) > 0 such that Hoffman inequality holds 
(for a proof of the Hoffman inequality see e.g. [HITT]): 


Ih 


*11 < e{A,c) 


Ax — b 
[Cx — d]_|_ 


V* e R", 


where x = [x\'p = argmin^g-pjl^; — x\\p. The constant 0(A,C) is the Hoffman 
constant for the polyhedron V with respect to the pair of norms (IHIa: ||•||/3)■ 
In [3, the authors provide several estimates for the Hoffman constant. Assume 
that A has full row rank and define the following quantity: 


Ca piA, C) := min min \\A u + C u||/ 3 * 
/ey 


= l,vi> 0,w[m]\/ = 0 


where J = {I £ 21™! : card / = r — p, rank[A^, Cf] = r} and r = 
rank[A^, C^]. An alternative formulation of the above quantity is: 


Ca.piAC) 


= sup ■ 


WA^u + C'^v\\p* = l,rows of C 
corresponding to nonzero components of v J 
and rows of A are linearly independent 


(35) 


In [3 it was proved that (^a,p{A,C)~^, where Ca,p{A,C) is defined in (1^ . is 
the Hoffman constant for the polyhedral set V w.r.t. the norms (IHIa, IMI/s)- 
Considering the Euclidean setting (a = /3 = 2) and the above assumptions, 
then from previous discussion we have: 


6 {A, C) = max 


CT„,in([A'r, cJYY 


Under some regularity condition we can state a simpler form for (^a, 2 {A,C). 
Assume that A has full row rank and that the set {h G R" : Ah = 0, Ch < 
0} Y 01 then, we have 0: 


Ca. 2 (A, C) := min 


\\A'^U + C'^v\\2 


= l,v>0 


(36) 


Thus, for the special case m = 0, i.e. there are no inequalities, we have 
C 2 , 2 (A, 0) = (7min(A), where (7min(A) denotes the smallest nonzero singular 
value of A, and the Hofman constant itQ: 


0(A,O) 


1 

O'min(A) ■ 


(37) 


^ This result can be also proved using simple algebraic arguments. More precisely, from 
Courant-Fischer theorem we know that ||Afc|| > <Tiiiin(^)||2^|| foi" ^ ^ Im(A^). Since we 
assume that our polyhedron "P = {x : Ax = 6} is non-empty, then x — [xjp € Im(A^) for 
all X G (from KKT conditions of ||x — z\\^ we have that there exists /r such 

that X — [x]-p -|- = 0). In conclusion, we get: 

||Ax — 6|| = ||Ax — > crinin(A)||x — [x]'p\\ = crmin(A)dist 2 (x, P) Vx G 
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4.1 Composition of strongly convex function with linear map is in Kfi^) 

Let us consider the class of optimization problems (P) having the following 
structured form: 


/*=min/(a;) = g{Ax) (38) 

X 

s.t. : X € X = {x € R" : Cx < d}, 

i.e. the objective function is in the form /(x) = g{Ax), where 5 is a smooth and 
strongly convex function and A S R"*^" is a nonzero general matrix. Problems 
of this form arise in various applications including dual formulations of linearly 
constrained convex problems, convex quadratic problems, routing problems in 
data networks, statistical regression and many others. Note that if A has full 
column rank, then g{Ax) is strongly convex function. However, if A is rank 
deficient, then g{Ax) is not strongly convex. We prove in the next theorem 
that the objective function of problem (1551) belongs to the class 

Theorem 8 Let X = {x G R" : Cx < d} be a polyhedral set, func¬ 
tion g : R'” —^ R 6 e ag-strongly convex with Lg-Lipschitz continuous gra¬ 
dient on X, and A G R™^” be a nonzero matrix. Then, the convex function 
/(x) = g{Ax) belongs to the class with constants Lf = 

and Kf = c) ’ 9{A,C) is the Hoffman constant for the polyhedral 

optimal set X*. 

Proof The fact that / has Lipschitz continuous gradient follows immediately 
from the definition O- Indeed, 

||V/(x) - V/(y)|| = WA^XgiAx) - A^Xg{Ay)\\ < \\A\\\\Xg{Ax) - Xg{Ay)\\ 
<\\A\\Lg\\Ax-Ay\\<\\ArLg\\x-yl 

Thus, Lf = Lg||H|p. Further, under assumptions of the theorem, there exists 
a unique pair {t*,T*) G R™ x R" such that the following relations hold: 

Ax*=t*, V/(x*) = T* yx*GX*. (39) 

For completeness, we give a short proof of this well known fact (see also 0): 
let X*, X 2 be two optimal points for the optimization problem (1381) . Then, from 
convexity of / and definition of optimal points, it follows that: 

j, ^ x^j +x^ ^ ^ /(x);) + /(x^) 

Since /(x) = g{Ax) we get from previous relation that: 

Ax\ + Ax *2 \ _ g{Ax\) + g{Ax* 2 ) 


9 


2 


2 
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On the other hand using the definition of strong convexity for g we have: 
' Axl + Ax2 ' 


9 


< 




Combining the previous two relations, we obtain that Ax\ = Ax 2 - Moreover, 
Vf{x*) = A^Vg{Ax*). In conclusion. Ax and the gradient of / are constant 
over the set of optimal solutions X* for (l38l) . i.e. the relations (l3^ hold. 
Moreover, we have that /* = f[x*) = g{Ax*) = g{t*) for all x* € X*. 
In conclusion, the set of optimal solutions X* is described by the following 
polyhedral set: 

X* = {x* : Ax* =t*, Cx* < d}. 

Since we assume that our optimization problem (P) has at least one solution, 
i.e. the optimal polyhedral set X* is non-empty, then from Hoffman inequality 
we have that there exists some positive constant depending on the matrices A 
and C describing the polyhedral set X*, i.e. 0{A, C) > 0, such that: 


Ik-ill < e{A,C) 


Ax — t* 
[Cx — d] + 


Va: e R", 


where x = [x\x* (the projection of the vector x onto the optimal set X*). 
Then, for any feasible x, i.e. x satisfying Cx < d, we have: 


\\x — x\\ < 9{A,C)\\Ax — Ax\\ Vx S X. 


On the other hand, since g is strongly convex, it follows that: 

0 On 

g{Ax) > g{Ax) + lyg{Ax),Ax — Ax)-\--^\\Ax — Ax\\ . 

Combining the previous two relations and keeping in mind that /(x) = g{Ax) 
and V/(x) = g{Ax), we obtain: 

/* > f{x) + (V/(x),x-x) -fi 26)2(H ^ 

which proves that the quasi-strong convex inequality (nni) holds for the con¬ 
stant k/= crg/0^(H, C). □ 

Note that we can relax the requirements for g in Theorem [HI For example, 
we can replace the strong convexity assumption on g with the conditions that 
g has unique minimizer t* and it satisfies the quasi-strong convex condition 
m with constant Kg > 0. Then, using the same arguments as in the proof of 
TheoremjHl we can show that for objective functions /(x) = g{Ax) of problem 
(P), the optimal set is X* = {x* : Ax* = t*, Cx* < d} and / satisfies (flUl) 
with constant Kf = c) ’ Provided that the corresponding optimal set X* 
is nonempty. 
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Moreover, in the unconstrained case, that is X = R", and for objective function 
f{x) = g{Ax), we get from (13711 the following expression for the quasi-strong 
convexity constant: 

= O'gcrmin(^)- (40) 

Below we prove two extensions that belong to other functional classes we have 
introduced in this paper. 


4.2 Composition of strongly convex function with linear map plus a linear 
term for X = M" is in re/(^) 

Let us now consider the class of unconstrained optimization problems (P), i.e. 
X = R", having the form: 

/* = min f{x) = g{Ax) + c^x, (41) 

i.e. the objective function is in the form f{x) = g{Ax) + c^x, where g is a 
smooth and strongly convex function, A G R™^" is a nonzero general matrix 
and c G R". We prove in the next theorem that this type of objective function 
for problem dm belongs to the class I/l/.kj- 

Theorem 9 Under the same assumptions as in Theorem [3 with X = R", 
the objective function of the form f{x) = g{Ax) -\- c^x belongs to the class 
Qlj constants Lf = Lg||7lp and Kf = 0(7l, 0) is 

the Hoffman constant for the optimal set X*. 

Proof Since g is Ug-strongly convex and with Lg-Lipschitz continuous gradient, 
then by the same reasoning as in the proof of Theorem [S] we get that there 
exists unique vector t* such that Ax* = t* for all x* G X*. Similarly, there 
exists unique scalar s* such that c^x* = s* for all x* G X*. Indeed, for 
x*,X 2 G X* we have: 

f* = 9{t*) + = g{t*) + c^X 2 , 

which implies that c^x\ = c^X 2 - On the other hand, since problem (P) is 
unconstrained, for any x* G X* we have: 

0 = Xf{x*)=A'^Xg{t*)+c, 

which implies that c^x* = —{Xg{t*))'^Ax* = —{Xg{t*))'^t*. Therefore, the set 
of optimal solutions X* is described in this case by the following polyhedron: 

X* = {x* : Ax* = t*}. 

Then, there exists 0{A, 0) > 0 such that the Hoffman inequality holds: 


\\x — x\\ < 9[A,Q)\\Ax — Ax\\ Vx € R". 
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From the previous inequality and strong convexity of g, we have: 

(T 0 

<crg||Aa;-Axf < {yg{Ax)-yg{Ax),Ax-Ax) 

= {A^Vg{Ax) + c — A^Vg{Ax) — c,x — x) 

= (V/(a:) - V/(x),x - x). 

Finally, we conclude that the inequality on the variation of gradients (ED holds 
with constant k/ = 02 '^X 0 ) • 


4.3 Composition of strongly convex function with linear map plus a linear 
term is in 


Finally, let us now consider the class of optimization problems (P) of the form: 

/* =min/(a;) =g{Ax)+c^x (42) 

X 

s.t. : a; G X = {a; G R" : Cx < d}, 

i.e. the objective function is in the form f{x) = g{Ax) + c^x, where g is a 
smooth and strongly convex function, A G is a nonzero matrix and 

c G R”. We now prove that the objective function of problem (l42]l belongs to 
class provided that some boundedness assumption is imposed on /. 

Theorem 10 Under the same assumptions as in Theorem [21 the objective 
function f{x) = g(Ax) + c^x belongs to the class for any constant 

M > 0 such that Xm = {x x € X, f{x) — f* < M}, with constants 
Lf = LgWAW^ and Kf = 02 ^^A,c,c){i+Mcr,+ 2 cl) ’ Hoffman 

constant for the polyhedral optimal set X* andcg = || Vg(j4a:*)||, withx* G X*. 

Proof From the proof of Theorem [H] it follows that there exist unique t* and 
s* such that the optimal set of is given as follows: 

X* = {a:* : Ax* = t*, c^x* = s*, Cx* < d}. 


From Hoffman inequality we have that there exists some positive constant 
depending on the matrices A^C and c describing the polyhedral set X*, i.e. 
0(H, C, c) > 0, such that: 


~ 3^11 < (^{A, c, C) 


Ax — t* 
c^x — s* 
[Ca: — d] + 


Va: G R”, 


where recall that x = [x]x* ■ Then, for any feasible x, i.e. satisfying Cx < d, 
we have: 


\\x — xW"^ < 0‘^{A,c,C) (f\\Ax — AxW"^ + {(?'X — (FxY) Vx G X. (43) 
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Since f{x) = g{Ax) + c^x and g is strongly convex, it follows from ([7]) that: 

g{Ax) — g{Ax) > (Vg^Ax), Ax — Ax) + — Ax\\‘^ 

= {A'^Vg{Ax) + c,x — x) — {c,x — x) + — AxW"^ 

= {^f{x),x-x) - (c,x - x) + yll^x - AxW^. 

Using that {\7f{x), x — x) > 0 for all x £ X, and definition of /, we obtain: 

/(a;) - /* > yll^a; - ^x||^ WxGX. (44) 

It remains to bound {c^x — c^x)^. It is easy to notice that 9{A, c, C) > l/||c||. 
We also observe that: 

c^x — c^x = (V/(x), X — x) — {Xg{Ax), Ax — Ax). 

Since f{x) — f* > (V/(x), x — x) > 0 for all x € X, then we obtain: 

Ic'^x - c^x| < /(x) - /* + ||Vg(Ax)|| ll^x - Ax\\, 

and then using inequality (a+/3)^ < 2a^ + 2/?^ and considering /(x) —/* < M, 
Cg = ||V 5 (t*)|| and (Hil) . we get: 

{Jx - Jx)^ < 2(/(x) - f*)^ + 2cl\\Ax - Axf 

< (^2M + (/(x) - n Vx e X, fix) -r<M. 

Finally, we conclude that: 

||^_-||2 ^ 26> (A, c, C) 2 c2) ifix) - /*) Vx e X, /(x)-/* < M. 

This proves the statement of the theorem. □ 

Typically, for feasible descent methods we take M = /(x°) —/* in the previous 
theorem, where x^ is the starting point of the method. Moreover, if X is 
bounded, then there exists always M such that /(x) — f* < M for all x £ X. 
Note that the requirement /(x) — f* < XI for having a second order growth 
inequality (1771) for / is necessary, as shown in the following example: 

Example 1 Let us consider problem (P) in the form (1421) given by: 

. 1 2 

mm -Xi + X 2 
xeR^ 2 

which has X* = {0} and f* = 0. Clearly, there is no constant k/ < oo such 
that the following inequality to be valid: 

/(a:)>Y||x|p Vx>0. 
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We can take for example = 0 and X 2 —>■ + 00 . However, for any M > 0 
there exists Kf{M) < 00 satisfying the above inequality for all a; > 0 with 
f{x) < M. For example, we can take: 

Kf{M) = min{l, L} ^ /i/(M) = L for M > 1. 

Note that for this example 0{A, c, C) = = 1. □ 

In the sequel we analyze the convergence rate of several first order methods 
for solving convex constrained optimization problem (P) having the objective 
function in one of the functional classes introduced in this paper. 


5 Linear convergence of first order methods 

We show in the next sections that a broad class of first order methods, cover¬ 
ing important particular algorithms, such as projected gradient, fast gradient, 
random/cyclic coordinate descent, extragradient descent and matrix splitting, 
have linear convergence rates on optimization problems (P), whose objective 
function satishes one of the non-strongly convex conditions given above. 


5.1 Projected gradient method (GM) 

In this section we consider the projected gradient algorithm with variable step 
size: 

Algorithm (GM) 

Given x'^ G X for fc > 1 do: 

1. Compute 0 ;'=+^ = [x'^ -akVf(x^)]^ 

where Ofc is a step size such that au € ^ 

5.1.1 Linear convergence of (GM) for gSj^^ 

Let us show that the projected gradient method converges linearly on opti¬ 
mization problems (P) whose objective functions belong to the class 

Theorem 11 Let the optimization problem (P) have the objective function 
belonging to the class . Then, the sequence x^ generated by the projected 

gradient method (GM) with constant step size ak = 1/Lf on (P) converges 
linearly to some optimal point in X* with the rate: 

Wx'^-x'^f < ||x°-x°||2, where Pf = ^. (45) 

VI + M// Lf 
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Proof From Lipschitz continuity of the gradient of / given in © we have: 

< f{x^) + (V/(x'=),x'^+i - x^) + - x^^f. (46) 

The optimality conditions for x^'^^ are: 

{x^+^-x^+ akVf{x'^),x-x'^+^)>Q VccGX. (47) 


Taking x = x^ in (HTl) and replacing the corresponding expression in (146 L 
we get: 


fix 


fc+i 


)<f{x'^) + i^-^) 

1 OL]z 




— X 


ak<L.^ T 

P < fix'^)-^\ 




— X 


k\\2 


Further, we have: 

ll^fc+i _ _fe||2 ^ ii^fe _ -fc||2 ^ - x'^) + ||a;'=+i - x'^f 

= Wx'^ - x^f + 2(x'^+^ - x^,x'^+^ - x'^) - \\x'^+^ -x'^f 

HTl 

< llx'^ - x^f + 2ak{Vfix^),x^ - x'^+i) - - x'^f 

= lla;''- x'^\\'^ + 2ak{^f{x^),x’^-x’^)+2ak{^f{x^),x'^- a;''+^) 
- Wx'^+^-x^f 

< Wx’^ - x'^f + 2 a,(r - fix'^) - -{Wx'^ - x'^f) 

- 2ak{{^fix%x’^+^ - x^) + - x^f) 

= (1 - akKf)\\x'" - + 2akf* 

- 2ak{fix>^) + (V/(x'=),x'=+i - x>^) + - x^f) 

L/< 1 / cx-k 

< {l-aknf)\\x'^ - x^W^+ 2akf* 

- 2ak[fix^) + {Vf{x%x'^+^ - x'^) + ^\\x^+^ - ) 

He! 

< (1 - akKf)\\x^ - x^f - 2ak{fix^+^) - /*)■ 


Since (1101) holds for the function /, then from Theorem H] we also have that 
(l22l) holds and therefore fix^'^^) — f* > Combining the last 

inequality with the previous one and taking into account that ||a;^+^ — x^^^ |1 < 
ll^-fe+i — x’^W ^ we get: 

ll^fe+l _ -fe+l||2 < _ 2;fc||2 _ ak^f\\x’^+^ - x’^+^f, 


or equivalently 


ll^-fc+l _ ^fe+l||2 < 


1 - akKf 
1 + akKf 


Ix'^-x'^ 


2 


(48) 
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However, the best decrease is obtained for the constant step size ak = ^/Lf 
and using the definition of the condition number /r/ = Kf /Lf, we get: 


ll^fc+l _ ^fe+l||2 < , |, fe _ -fc||2^ 


1 + /r/ 

This proves our statement. □ 

Based on Theorem[TT]we can easily derive linear convergence for the projected 
gradient algorithm (GM) in terms of the function values: 


diet 


min/(x'=) + (V/(x'=),x - x^) + ^\\x^ - xf 
xgx zak 

< mm fix) + ^\\x^ - xf < fix^) + - x'^f 

x^x Zak 


- ^ 2 + 


llx°-x°|P. 


Finally, the best convergence rate is obtained for constant step size ak = 1/T/: 




ED L 


f\\x°-xT 


1-M/ 
1 + /r/ 


k-1 


yk> 1. 


(49) 


However, this rate is not continuous as /i/ —>■ 0. For simplicity, let us assume 
constant step size ak = i/Lf, and then, using that (GM) is a descent method, 
i.e. /(x^) — /* < f{x^~^) — f* for all j < k and iterating the main inequality 
from the proof of Theorem [TTl we obtain: 

< (1 _ _ -fc-l||2 _ 2 _ 

< (1 - M/)"ik° - -1- E(i - - n 

f j=0 


< (1 - M/)"ik° - - — (/(x'^) - r) ^(1 - M/)^'. 

f j=0 

Finally, we get linear convergence in terms of the function values: 


fix'^) - r < 


"/I 


(1 - M/) - 1’ 


(50) 


Since (1 + a)^ —>■ 1 + afc as a —0, then we see that: 


k-f 


<7 as 0, 


[1 — fif) ^ — 1 k 
and thus from (1501) we recover the classical sublinear rate for (GM) as /r/ —)► 0: 

/(x^)-r< Lih°-^T 


2k 


as /r/ —>• 0. 
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5.1.2 Linear convergence of (GM) for 

We now show that the projected gradient method converges linearly on opti¬ 
mization problems (P) whose objective functions belong to the class 

Theorem 12 Let optimization problem (P) have objective function belonging 
to the class ^ . Then, the sequence generated by the projected gradient 
method (GM) with constant step size ak = 1/Lf on (P) converges linearly to 
some optimal point in X* with the rate: 

where Pi f='^. (51) 

Vl + M// 

Proof Using similar arguments as in the previous Theorem II11 we have: 

\\x^+^ - xf = Wx'^ - xf + 2(x'^ - x,x'‘+^ - x^) + ||x'=+i - x^^f 
= - cr ||2 + 2{x^+^ - x,x'^+^ - x^) - llx'^+i - 

ITtI 

< llx'^ - xf -b 2afc(V/(x'=), X - x'^+i) - llx'^+i - x^f 

< Wx'^ - x ||2 - 2 afc((V/(x'=),x'=+i - x) + ^llx'^+i - x^f 

* 

= ||x'=-xf+ (L/a,-l)||x'=+i-x^f 

- 2afe^(V/(x'"),Xfc-x) + (V/(x''),x''+^-Xfc)-b^||x''+^-x''||^^ 

]46t 

< llx'^-xf+ (L/afc-l)||x'=+i-x'=f 

+ 2 afc(/(x) - /(x^-)) + 2ak{f{x^) - /(x'=+i)) 

ak<Lp 

< ||x'=-xf-2afe(/(x'=+i)-/(x)) VxeX. 

Taking now in the previous relations x = x^, using ||x^+^ — x^+^|| < ||x^+^ — 
x^j| and the quadratic functional growth of / (l 2 ^ . we get: 

_ ^fc+i ||2 '9 \\x^ - x'^ll^ - Kfak\\x^"^^ - x'^+^lp 

or equivalently 

ll^fc+l _ 5;fe+l|l2 < ( 53 ) 

1 -b KfUk 

However, the best decrease is obtained for the constant step size ak = 1/Lf 
and using the definition of the condition number pf = Kf/Lf, we get: 

1 + M/ 
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Thus, we have obtained the linear convergence rate for (GM) with constant 
step size ak = ^/Lf from the theorem. □ 

Using similar arguments as for (HiHl and combining with (l5^ we can also derive 
linear convergence of (GM) in terms of the function values: 


1 1521 1 / 1 

2ak 2ak Vl + 




and the best convergence rate is obtained for constant step size ak = 1/T/: 


fix^)-r < 


SB L 


f\\x^-xT 


1 


fe-i 


V/fc > 1. 


(53) 


.1 + M/, 

However, this rate is not continuous as fjif 0. We can interpolate between 
the right hand side terms in (|3]) and (15^ to obtain convergence rates in terms 
of function values of the form: 


/(x") - r < 


Lf\\x^-xT ^ ^/l|x°-x°|| 
2{k — t) 


1 


2{k-t) {l + fJ-fY 


Vt = 0 : fc — 1, 


or equivalently 


-r< 


L/||x° — x°"^ 


t=0:fe-l (1 + HfY{k — t) 


Finally, in the next theorem we establish necessary and sufficient conditions 
for linear convergence of the gradient method (GM). 

Theorem 13 On the class of optimization problems (P) the sequence gener¬ 
ated by the gradient method (GM) with constant step size is converging linearly 
to some optimal point in X* if and only if the objective function f satisfies the 
quadratic functional growth (1221) . i.e f belongs to the functional class 

Proof The fact that linear convergence of the gradient method implies / sat¬ 
isfying the second order growth property (|22l) follows from Theorem O The 
other implication follows from Theorem IT^ eq. (1521) . □ 


5.2 Fast gradient method (FGM) 

In this section we consider the following fast gradient algorithm, which is a 
version of Nesterov’s optimal gradient method m- 


Algorithm (FGM) 

Given x^ = G X, for /c > 1 do: 


1. Compute = 

y"-77V/(y'=) 

and 




for appropriate choice of the parameter /3k > 0 for all /c > 0. 
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5.2.1 Linear convergence of (FGM) for gSj^^ 

When the objective function / G we take the following expression 

for the parameter Pk- 

Pk = ^ Vfc > 0 . 

First of all we can easily observe that if / G (X), then the gradient 

mapping g(x) satisfies the following inequality: 

f* > f{x^) + {g{x),x-x) + -^||g(a;)f + - x\\ = „ (x,a;) (54) 

2Lf 2 

for all a: G M" (recall that x = and x^ = [x — 1 /LfVf{x)]x). The 

convergence proof follows similar steps as in [11] [Section 2.2.4]. 

Lemma 1 Let optimization problem (P) have the objective function f be¬ 
longing to the class qSj^^ and an arbitrary sequence {y^}k>o satisfying 
= [y^]x* = y* for all k > 0. Define an initial function: 

(j)o{x) = (^0 + y ||a; - where 70 = Kf, v° = y° and (fl = f{y°), 

and a sequence {afc}fe>o satisfying ak G (0, 1). Then, the following two se¬ 
quences, iteratively defined as: 


Afc+i = (1 - ak)Xk, with Ao = 1, 
(fk+iix) = (1 - ak)(fk{x) 


(55) 


+ ak{f{x^^^) + ^My'^)r + {g[y^),x-y^) + ^^^^ 


where x^ = y'^ and x^'^^ = 


y^ — -y V/(j/^) , satisfy the following property: 


J X 


My*) < (1 - Xk)r + XkMy*) Vfc > o. (56) 

Proof We prove this statement by induction. Since Aq = 1, we observe that: 

0o(2/*) = (l-Ao)r+A o<(.o(2/*). 

Assume that the following inequality is valid: 

My*)<i^-Xk)r + XkMy*), ( 57 ) 

then we have: 

Miiy*) = MiM) = (1 - MMy'") + ctkqLf,KfM,y'') 

IHJ 

< {1 - ak)My ) + otkf* 

= [1 - (1 - ak)Xk]r + (1 - M {Mf) - (1 - Xk)f*) 

2/''=y* + lIZ} 

< {1 — Xk+i)f* + Xk+iffoiy*)- 

which proves our statement. □ 
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Lemma 2 Under the same assumptions as in Lemma [7] and assuming also 

that the sequenee {a;fe}fc>o, defined as = ip and f{y^) , 

L J jc 

satisfies: 


f{x^) < = min fkix) Vfc > 0, (58) 

then we obtain the following convergence: 

fix'^) -r<Xk (/(X°) - r + ^\\y* - y°||) . (59) 

Proof Indeed we have: 

fix^) -r< -r = min Mx) -r< Myi - r 

a:eR" 

l56ll 

< (1 - Afc)r + XkMy*) - r = Afe (My*) - D, 

which proves the statement of the lemma. □ 

Theorem 14 Under the same assumptions as in Lemma [H the sequence 
generated by fast gradient method (FGM) with constant parameter fk = 
{^/L7f — s/UJ) / {\fL] + converges linearly in terms of function values 

with the rate: 

fix’^)-f* < {1-y/JIjf ■2{f{x°)-f*) , where = (60) 

provided that all iterates y^ produce the same projectio'^ onto optimal set X*. 

Proof Let us consider x^ = y^ = v'^ G X. Further, for the sequence of functions 
(j>k{x) as defined in (1551) take ctfc = ,yjlj G (0, 1) for all fc > 0 and denote 
a = y/JLf. First, we need to show that the method (FGM) defined above 
generates a sequence satisfying > f{x^). Assuming that fkfx) has the 
following two properties: 

^k{x) = fl + ^-Wx - and flPfix’^), 

where = mina,gRn 4>k{x) and = argmina,gRn 4>k{x), then we will show 
that fk+iix) has similar properties. First of all, from the dehnition of (jjk+iix)^ 
we get: 

X'^fk+lix) = ((1 - a)Kf + aKf) In = Kfin, 
i.e. (pk+iix) is also a quadratic function of the same form as (fkix): 

fk+px) = , 


^ See Remark ^ below for an example satisfying this condition. 
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where the expression of = argminj^gRn is obtained from the 

equation V(/)fe+i(a:) = 0, which leads to: 

= — ((1 — a)KfV^ + UKry^ — ag{y^)) . 

Kf 

Evaluating (pk+i in y^ leads to: 

+ yll/ - =(1 - «) + yll/ - 

+ a(^/(x^+i) + ^||g(y'=)f). 

On the other hand, we have: 

- / = — ( k /(1 - a){v’^ - y^) - ag{y*^)) . 

Kf 

If we substitute this expression above, we obtain: 

=(1 - aWu + \\g{y^)f 

+ a(l - a) + {g{y^),v>^ - /)) . 

Using the main property of the gradient mapping (1291) , valid for functions with 
Lipschitz continuous gradient, we have: 

+ {g{y'^).x^ - /) + 

2Lf 

Substituting this inequality in the previous one we get: 

ct>Ui > llff(2/'=)f+ (l-«)(5(/),«(^'=-y^)+x'=-/). 

Since a = then = 0. Moreover, we have the freedom to choose 

which is obtained from the condition a{v^ — y^) + — y^ = 0: 

y^ = j^{av’^+x’^). 

I + a 

Then, we can conclude that (j)k+i — f{x^~^^)- Moreover, replacing the ex¬ 
pression of y^ in leads to the conclusion that we can eliminate the se¬ 
quence since it can be expressed as: = a:^ -I- — x^). Then, 

we find that has the expression as in our scheme (FGM) above with 

Pk = (\/^~ \/^)/{\/^ + \/^)- Using, now Lemmas [T] and [2] we get the 
convergence rate from (I5D1) (we also use that ^\\x^ — x^\\^ < f{x^) — /*)• □ 
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Remark 1 For unconstrained problem minj^gRn the gradient in some 

point y is given by g{Ay) G Range(A^). Then, the method (FGM) gen¬ 
erates in this case a sequence y^ of the form: 

/ =y° + A'^z^, e Vk> 0. 

Moreover, for this problem the optimal set X* = {x : Ax = t*} and the 
projection onto this affine subspace is given by: 

= (/„ - A^iAA^r^A) (.) + A^{AA^)-h\ 

In conclusion, all vectors y^ generated by algorithm (FGM) produce the same 
projection onto the optimal set X*\ 

f = y° - A^{AA^)-^Ay° + A^iAX^yH* Vfc > 0, 

i.e. the assumptions of Theorem [Tulare valid for this optimization problem. □ 


5.2.2 Linear convergence of restart (FGM) for 


It is known that for the convex optimization problem (P), whose objective 
function / has Lipschitz continuous gradient, and for the choice: 


A = with = 1 and 

the algorithm (FGM) has the following convergence rate [TTJ[T3]: 

2Lj\\x^-xT 


fxy -r< 


(fc + l)2 


V/c > 0. 


(61) 


We will show next that on the optimization problem (P) whose objective 
function satisfies additionally the quadratic functional growth (12^ . i.e. / G 
Flj Kf > restarting version of algorithm (FGM) with the above choice of 
Pk has linear convergence without the assumption = y* for all k > 0. 
Restarting variants of (FGM) have been also considered in other contexts, see 
e.g. m- By fixing a positive constant c G (0, 1) and then combining (I61|) and 
(1^ . we get: 


/(x") - r < 


2Lf 


(k + iy 




< 


4Lf 


Kf(k + iy 


:(fy^)-n<c(f(xy-n, 


which leads to the following expression: 

c=i^. 

Kfk'^ 


Then, for fixed c, the number of iterations Kc that we need to perform in order 
to obtain /(x^^) — /* < c(/(x°) — /*) is given by: 
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Therefore, after each Kc steps of Algorithm (FGM) we restart it obtaining the 
following scheme: 

Algorithm (R-FGM) 

Given & X and restart interval Kc- For j > 0 do: 

1. Run Algorithm (FGM) for Kc iterations to get 

2. Restart: yOJ+i = and 6i = 1. 

Then, after p restarts of Algorithm (R-FGM) we obtain the linear convergence: 


2 i/ll 


^0,p-l _^0,p-l|l2 


/(^o.p) _ ^ /(x^-^’-i) -r< ^ 

< c(/(a:°’^’-i) -/*)<•••< c^’(/(a:°’°) - /*) = cP(/(a:°) - /*). 

Thus, total number of iterations is k = pKc and denote x^ = Then, we 
have: 

/(x")-r < (c*)"(/(x°)-r). 

We want to optimize e.g. the number of iteration Kc- 

■ ^ . 1 , .1,4 

mm c mm —— log c mm log-— 77 , 

ifc ifc Ac ifc Ac pLfK^ 


which leads to 


A! = 


2 e 

vW 


and c = e 


-2 


In conclusion, we get the following convergence rate for (R-FGM) method: 


/(x'^) - r < () (/(x^) - r) = ( e 




(/(x°)-r), (62) 


and since e“ ~ 1 -I- a as a ~ 0, then for Ri 0 we get: 

/(x"') - r < (/(^'’) - n ^ (1 - (/(x°) - r). ( 63 ) 

Note that if the optimal value /* is known in advance, then we just need 
to restart algorithm (R-FGM) at the iteration Ac < A* when the following 
condition holds: 

-r< c(/(x°’^) - r), 

which can be practically verified. Using the second order growth property ()22l) 
we can also obtain easily linear convergence of the generated sequence x^ to 
some optimal point in A*. 








Linear convergence of first order methods for non-strongly convex optimization 


29 


5.3 Feasible descent methods (FDM) 

We now consider a more general descent version of Algorithm (GM) where the 
gradients are perturbed: 



Algorithm (FDM) 

Given a;° G A and ,5, L > 0 for fc > 0 do: 

Gompute 
such that 

|e^' 1 < /3 la:^"*-! 

= [x^ - akVf{x^) + 

-x’^W and /(a;'=+i) < /(a:'=)-f ||a;'=+i-x'=|l2. 


where the stepsize Uk is chosen such that ak > Lj^ > 0 for all k. It has been 
showed in laiiz] that algorithm (FDM) covers important particular schemes: 
e.g. proximal point minimization, random/cyclic coordinate descent, extragra¬ 
dient descent and matrix splitting methods are all feasible descent methods. 
Note that linear convergence of algorithm (FDM) under the error bound as¬ 
sumption (EH), i.e. / G is proved e.g. in [SUIT]. Hence, in the next 

theorem we prove that the feasible descent method (FDM) converges linearly 
in terms of function values on optimization problems (P) whose objective func¬ 
tions belong to the class 

Theorem 15 Let the optimization problem (P) have the objective funetion 
belonging to the class . Then, the sequence Xk generated by the feasible 

descent method (FDM) on (P) converges linearly in terms of function values 
with the rate: 


f(x^) - r < (^— I (/(x°) - r). ( 64 ) 

Y+ A{Li+Lf+pLfY ) 

Proof The optimality conditions for computing ^j-e: 

( 2 ;fe+i -x^ + akVf{x^) - e^ a; - x^+^) >0 Va; G A. (65) 

Then, using convexity of / and Cauchy-Schwartz inequality, we get: 

/(x'^+i) - /(x'=+^) < (V/(a:'=+i),a;'=+i - x'^+i) 

= (V/(x'=+i) - V/(x'=) + V/(x'=),x'=+i - x^+^) 

® Lf\\x’^+^ - x’^WWx’^^^ - x’^+^\\ + — {x’^+^ - x’^ - a;'=+i) 

Ctk 

< {Lf + Lf)\\x'^+^ - x'=||||x'=+i - x^+^\\ + Lf\\e%\x^+^ - x^+^\\ 

< {Lf + Lf + pLf)\\x’^+^-x'^\\\\x'^+^-x^+^\\. 
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Since / € then it satisfies the second order growth property, i.e. 

— > -^ 110 ;^+^ —and using it in the previous derivations 

we obtain: 

/(x^+i) - /(x'=+i) < (00) 

«/ 

Combining (l66ll with the descent property of the algorithm (FDM), i.e. ||x^+^ — 
< f {f{x’^) - we get: 

which leads to 

/(x'=+i) - /(x'=+i) < ^-(/(x'^) - f{x>^)) . 

1 + 4{Lf + Lf+0Lfy 

Using an inductive argument we get the statement of the theorem. □ 

Note that, once we have obtained linear convergence in terms of function 
values for the algorithm (FDM), we can also obtain linear convergence of the 
generated sequence to some optimal point in X* by using the second order 
growth property (l22ll . 


5.4 Discussions 

From previous sections we can conclude that for some classes of problems 
improved linear convergence rates are obtained as compared to the existing 
results. For example, in [siiiniin] it has been proved that the optimization 
problem whose objective function satisfies the conditions of Theorem [TUI 
has on a compact set an error bound property of the form (I31F In this paper 
we proved that this class of problems has the objective function satisfying 
the quadratic functional growth For the class of problems having an 

objective function satisfying an error bound condition the feasible descent 
method (FDM) is shown to converge linearly in [5 HTU1ITB1IT7] . Note that for 
ttfc = 1/T/, /? = 0 and L = Lf we recover from algorithm (FDM) the algorithm 
(GM). However, for these choices the linear convergence in (1551) . given by 
is better than the one obtained in Theorem II51 given by 

Recently, in m the authors show that the class of convex unconstrained prob¬ 
lems min 2 ;gRn g[Ax), with g strongly convex function having Lipschitz contin¬ 
uous gradient, satisfies a restricted strong convex inequality, which is a par¬ 
ticular version of our more general quadratic gradient growth inequality (O. 
However, in this paper we proved that the objective function of this particular 
class of optimization problems belongs to a more restricted functional class, 
namely ^^(X), i.e. it satisfies (fTIl . Thus, for this class of problems we 
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provide better linear rates for gradient method and for fast gradient method 
as compared to m- More precisely, for the gradient method (GM) we derive 
convergence rate of order (1 — ^/)/(l + /t/), while [16] proved convergence rate 
of order (1 — fif). Moreover, from our best knowledge, this paper shows for 
the first time linear convergence of the usual fast gradient method (FGM) for 
this class of convex problems min 2 ;gRn while for example |16j derives 

a worse rate of convergence and for a restarting variant of the fast gradient 
method (R-FGM). 


6 Applications 

In this section we present several applications having the objective function in 
one of the structured functional classes of Section H) 


6.1 Solution of linear systems 


It is well known that finding a solution of a symmetric linear system Qx+q = 0, 
where Q ^ 0 (notation for positive semi-definite matrix), is equivalent to 
solving a convex quadratic program (QP): 


Let Q = LqLq be the Cholesky decomposition of Q. For simplicity, let us 
assume that our symmetric linear system has a solution, e.g. Xg, then q is 
in the range of Qi i-e. q = —Qxg = —L^qLqXs- Therefore, if we define the 
strongly convex function g{z) = — {LqXs)^ z, having Lg = (jg = 1, then 

our objective function is the composition of g with the linear map Lqx: 

fix) = \\\Lqx\\^ - {L^LqXsfx = giLgx). 

Thus, our convex quadratic problem is in the form of unconstrained structured 
optimization problem (l38ll and from Section |4] we conclude that the objective 
function of this QP is in the class with: 

L/ = A„,ax(Q)and«:/=a^„(LQ) = A^in(Q) ^ conAiQY 


where Amin(Q) denotes the smallest non-zero eigenvalue of Q and Aniax(Q) is 
the largest eigenvalue of Q. Since we assume that our symmetric linear system 
has a solution, i.e. /* = 0, from Theorem [14] and Remark [T] we conclude that 
when solving this convex QP with the algorithm (FGM) we get the convergence 
rate in terms of function values: 


1 - 


2/(x°) 


fixY < 


cond((5) 
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or in terms of residual (gradient) or distance to the solution: 

2L^ 

WQx'^ + gf = ||V/(x'=)||2 < _ ^fc||2 < _± _ j*) 

Kf '■ 

Therefore, the usual (FGM) algorithm without restart attains an e optimal 
solution in a number of iterations of order y/cond{Q) log ^, i.e. the condition 
number cond((5) of the matrix Q is square rooted. From our knowledge, this is 
one of the first results showing linear convergence depending on the square root 
of the condition number for the fast gradient method on solving a symmetric 
linear system with positive semi-definite matrix Q ^ 0. Note that the linear 
conjugate gradient method can also attain an e approximate solution in much 
fewer than n steps, i.e. the same -ycond^^logi iterations [T]. Usually, in 
the literature the condition number appears linearly in the convergence rate 
of hrst order methods for solving linear systems with positive semi-definite 
matrices. For example, the coordinate descent method from requires ■ 
cond(( 3 ) log i iterations for obtaining an e optimal solution. 

Our results can be extended for solving general linear systems Ax + b = 0, 
where A G In this case we can formulate the equivalent unconstrained 

optimization problem: 

min \\Ax -I- 

xGR" 

which is a particular case of (IMl) and from Section |4] we can also conclude that 
the objective function of this QP is in the class qSLf,Kf with: 

^/= c^Lx(^) and K/=(T^i„(A) ^ M/ = 4 ! > 

where crmin(^) denotes the smallest non-zero singular value of A and crmax(^) 
is the largest singular value of A. In this case the usual (FGM) algorithm 
attains and e optimal solution in a number of iterations of order log ^. 


6.2 Dual of linearly constrained convex problems 

Let (P) be the dual formulation of a linearly constrained convex problem: 
ming(M) 

U 

s.t. : c- A^uGK = W^ xK”L 

Then, the dual of this optimization problem can be written in the form of 
structured problem (I42L where g is the convex conjugate of g. From duality 
theory we know that g is strongly convex and with Lipschitz gradient, provided 
that g is strongly convex and with Lipschitz gradient. 
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6.3 Lasso problem 

The Lasso problem is defined as: 

min ||Ax-6f+ A||a;|li. 

x\Cx<d 

Then, the Lasso problem is a particular case of the structured optimization 
problem (H^ . provided that e.g. the feasible set of this problem is bounded 
(polytope). 


6.4 Linear programming 


Finding a primal-dual solution of a linear cone program can also be written 
in the form of a structured optimization problem (1551) . Indeed, let c G M-^,b G 
R™ and /C C be a closed convex cone, then we define the linear cone 
programming: 


min(c, u) s.t. Eu = b, uGK, (67) 

U 

and its associated dual problem 

min(6, u) s.t. E'^v + s = c, s G JC* , (68) 

V,S 


where JC* denotes the dual cone. We assume that the pair of cone program¬ 
ming (EZli-dMl) have optimal solutions and their associated duality gap is zero. 
Therefore, a primal-dual solution of (I67D -(I55 1) can be found by solving the fol¬ 
lowing convex feasibility problem, also called homogeneous self-dual embed¬ 
ding: 


find (u, u, s) such that 
or, in a more compact formulation: 

find X such that 

u 

where x = 


E"^v -|- s = c, Eu = 6, (c, u) = {b, v) 
uG 1C, s G JC*, V G R™, 


(69) 



u 


■ 0 E^ /„■ 


C 


V 

, 71 = 

A 0 0 

, d = 

b 

, K =/C X R™ X /C*. The 

s 


qT _fjT Q 


0 


15] proposed solving conic optimization problems in homogeneous 


self-dual embedding form using ADMM. In this paper we propose solving a 
linear program in the homogeneous self-dual embedding form using the first 
order methods presented above. A simple reformulation of this constrained 
linear system as an optimization problem is: 


min IIAx — c?||^. 


(70) 
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Denote the dimension of the variable x as n = 2N + m. Let us note that the 
optimization problem cni) is a particular case of with objective function 
of the form f{x) = g{Ax), with g{-) = || • —d|p. Moreover, the conditions of 
Theorem [5] hold provided that K. = We conclude that we can always solve 
a linear program in linear time using the first order methods described in the 
present paper. 


6.5 Numerical simulations 

We test the performance of first order algorithms described above on randomly 
generated Linear Programs (EH) with 1C = We assume Linear Programs 
with finite optimal values. Then, we can reformulate EH as the quadratic 
convex problem dzni) for which /* = 0. We compare the following algorithms 
for problem (1701) (the results are given in Figures 1 and 2): 

1. Projected gradient algorithm with fixed stepsize (GM): ak = |1^||~^ (in 
this case the Lipschitz constant is Lf = ||A|p). 

2. Fast gradient algorithm with restart (R-FGM): where c = 10“^ and we 
restart when \\Ax^'=’^ — d|| < c\\Ax'^’^ — d||. 

3. Exact cyclic coordinate descent algorithm (Cyclic CD): 

x^^^ = arg min \\Axi{k) - d\\'^, 

Xi£K.i 

where is either R+ or M and Xi{k) = ■ ■ ■ x^^l Xi x^j^i ■ ■ ■ x^]. It 

has been proved in [5] that this algorithm is a particular version of the 
feasible descent method (FDM) with parameters: 

Qffc = 1, (3 = 1 + Lf^/n, L = min 

i 

provided that all the columns of A are nonzeros, i.e. ||Ai|| > 0 for all 
i = 1 •. n. 

The comparisons use Linear Programs whose data {E, b, c) are generated ran¬ 
domly from the standard Gaussian distribution with full or sparse matrix E. 
Matrix E has 100 rows and 150 columns in the full case and 900 rows and 
1000 columns in the sparse case. Figures 1 and 2 depict the error \\Ax^ ~ <^ll- 
We can observe that the gradient method has a slower convergence than the 
fast gradient method with restart, but both have a linear behaviour as we 
can see from the comparison with the theoretical sublinear estimates, see Fig¬ 
ure 1. Moreover, the fast gradient method with restart is performing much 
faster than the gradient or cyclic coordinate descent methods on sparse and 
full Linear Programs, see Figure 2. 
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